Geospatial distributed processing with furrr

2023-01-23

This is a guest post from Ryan Garnett, Data Management Insights & Analytics Manager at Green Shield Canada. Ryan works to bridge technical teams and senior executives by effectively communicating complex transformational change, achieved through international experience in technology development, operational management, client service, and strategic development. Keep in touch on LinkedIn.

The challenge

Everything happens somewhere. Location is an important factor when people make decisions that influences various life decisions, from buying a house, choosing a school, to vacations, or selecting a restaurant to eat out at. Within the data space utilizing location data is becoming increasingly important in data analysis, data science, and machine learning. Data wrangling is an integral part of data science, typically occupying a significant portion of time and effort, with location data adding to the complexity.

Location data can be large, consisting of hundreds of thousands to billions of observations in a single dataset. With location-based analysis consistently utilizing multiple datasets, the time to complete the analysis can be lengthy. Similar to other data analytics, the majority of location-based analysis tasks are performed sequentially using a single processor on a computer, even though many of the operations performed have no dependency on the other data or their outcome; if only there were another way…well, let me introduce you to distributed location-based processing.

What is distributed processing? In short, distributed processing is the use of multiple computer processors to execute a single computational task. A distributed process is executed by splitting a task into smaller chunks which are simultaneously performed on multiple processors. We can use distributed processing to make our code run faster, sometimes much faster. For large geospatial datasets that require a lot of computational power, distributed processing can reduce the time needed to handle complex data manipulation.

In this post, we’ll explore different techniques for wrangling location data, the processing time required for each, and the significant improvements that can be seen with distributed processing.

Setup

We performed the analysis for the post on a System76 Ubuntu laptop with the following configurations:

                               
1 Operating System Ubuntu 22.04
2            Cores            8
3        CPU speed      1.80GHz
4              RAM         32GB

Load packages

We use a number of packages in this post, some standard data analytics packages ({dplyr}, {janitor}, {ggplot2}, {gt}, and {readr}), as well as a few niche packages ({furrr}, {future}, {leaflet}, and {sf}).

# Importing data
library(readr)

# Data analysis and wrangling
library(dplyr)
library(janitor)
library(sf)

# Distributed processing
library(furrr)
library(future)

# Visualization and styling
library(ggplot2)
library(gt)
library(leaflet)

Helper functions

We will use location data from different sources with different attribute standards. To assist with the data wrangling, we create custom functions to help standardize attributes available within the source datasets.

# Add provincial abbreviation column from provincial name
create_prov_abbrv_from_name <- function(.data, prov_column){
  .data %>%
    dplyr::mutate(PROV = dplyr::case_when(
      {{prov_column}} == "Alberta" ~ "AB",
      {{prov_column}} == "British Columbia" ~ "BC",
      {{prov_column}} == "Manitoba" ~ "MB",
      {{prov_column}} == "New Brunswick" ~ "NB",
      {{prov_column}} == "Newfoundland and Labrador" ~ "NL",
      {{prov_column}} == "Northwest Territories" ~ "NT",
      {{prov_column}} == "Nova Scotia" ~ "NS",
      {{prov_column}} == "Nunavut" ~ "NU",
      {{prov_column}} == "Ontario" ~ "ON",
      {{prov_column}} == "Prince Edward Island" ~ "PE",
      {{prov_column}} == "Quebec" ~ "QC",
      {{prov_column}} == "Saskatchewan" ~ "SK",
      {{prov_column}} == "Undersea Feature" ~ "Non Province",
      {{prov_column}} == "Yukon" ~ "YK",
      TRUE ~ "Non Canadian"
    ))
}

# Add provincial abbreviation column from provincial id value
create_prov_abbrv_from_census_num <- function(.data, column_name){
  .data %>%
    dplyr::mutate(PROV = dplyr::case_when(
      {{column_name}} == 10 ~ "NL",
      {{column_name}} == 11 ~ "PE",
      {{column_name}} == 12 ~ "NS",
      {{column_name}} == 13 ~ "NB",
      {{column_name}} == 24 ~ "QC",
      {{column_name}} == 35 ~ "ON",
      {{column_name}} == 46 ~ "MB",
      {{column_name}} == 47 ~ "SK",
      {{column_name}} == 48 ~ "AB",
      {{column_name}} == 59 ~ "BC",
      {{column_name}} == 60 ~ "YK",
      {{column_name}} == 61 ~ "NT",
      {{column_name}} == 62 ~ "NU",
      TRUE ~ "Non Canadian"
    ))
}

# Change case of sf geometry column
change_sf_geometry_case <- function(df){
  names(df)[names(df) == "geometry"] <- "GEOMETRY" 
  attr(df, "sf_column") <- "GEOMETRY"
  df
}

Source data

Location data, also known as geospatial data, is a specific type of data that stores geographic coordinates for use in analysis and visualizations. There are two main types of location data: vector and raster. This blog post will leverage vector data, which consists of points, lines, or polygons.

For this blog, we will use point and polygon vector data, specifically Canadian postal code (points) and Census Canada economic regions (polygons). For more information on the data sources, see the two links below.

# Open source Canadian postal codes
# Download, import, and transform data
can_postal_codes <-
  readr::read_csv(
    "https://raw.githubusercontent.com/ccnixon/postalcodes/master/CanadianPostalCodes.csv"
  ) %>%
  janitor::clean_names("all_caps") %>%
  create_prov_abbrv_from_census_num(FSA_PROVINCE) %>%
  dplyr::select(FSA,
                POSTAL_CODE,
                PLACE_NAME,
                PROV,
                AREA_TYPE,
                LATITUDE,
                LONGITUDE) %>%
  sf::st_as_sf(coords = c('LONGITUDE', 'LATITUDE'),
               crs = 4326) %>%
  sf::st_make_valid() %>%
  change_sf_geometry_case()

# Statistics Canada economic regions data
# Download, import, and transform source zip file data
download.file(
  "https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/ler_000b21a_e.zip",
  destfile = "/data/ler_000b21a_e.zip"
)

# Unzip file
system("unzip /data/ler_000b21a_e.zip")

# Import and transform data
can_eco_regions <- sf::st_read("~/R/data/ler_000b21a_e.shp") %>%
  janitor::clean_names("all_caps") %>%
  create_prov_abbrv_from_census_num(PRUID) %>%
  sf::st_transform(crs = 4326) %>%
  sf::st_make_valid() %>%
  dplyr::select(ERUID, ERNAME, PROV) %>%
  change_sf_geometry_case()

Data exploration

We will use two datasets for this analysis, Canadian postal code points and Canadian Census economic polygons. Both datasets have location information and have undergone data wrangling to standardize their attributes. Let’s take a quick look at the volume of each of the datasets.

Dataset Number of Columns Number of Rows
Postal codes 6 889,291
Economic regions 4 76

The postal code point dataset has the largest volume of data distributed across Canada. The following table outlines the number of postal code points in each province and territory.

Province Number of Points
AB 86,977
BC 132,827
MB 27,833
NB 59,563
NL 11,344
NS 26,924
NT 604
NU 21
ON 295,166
PE 3,487
QC 218,745
SK 24,711
YK 1,089

While the volume of data in the polygon dataset is significantly smaller compared to the points layer, it should be noted that each polygon covers a large geographic area. The number of economic regions ranges from 1 to 17, depending on the province or territory.

Province Number of Polygons
AB 8
BC 8
MB 8
NB 5
NL 4
NS 5
NT 1
NU 1
ON 11
PE 1
QC 17
SK 6
YK 1

The postal code points are not evenly distributed across the economic regions. Areas with larger populations and urban development have a greater number of postal codes, as illustrated on the map of Nova Scotia, where a high volume of points are positioned in two main areas.

Spatial intersection

A typical location data wrangling task is to pull attribute information from one location dataset based on the geographical position of another location dataset. To achieve this, we leverage the st_intersection() function from the sf package, allowing us to extract attribute information in the economic regions polygon to be added to the postal code point layer.

The approach of the st_intersection() function is to have each point “check” to see if it geographically intersects with a record in the polygon layer. This approach is challenging when you have large volumes of data that span a large geographic region, like the case in Canada. The standard approach with the st_intersection() function is to run the function with the whole point and polygon layers, which results in a lot of needless requests between the point and polygon layers (i.e., points in Nova Scotia checking if they geographically intersect with polygons in British Columbia). What if a common geographic area could separate the points and polygons, reducing the volume of the data and allowing for chunks to be processed simultaneously on different computer cores? In theory, this is a fast approach.

To test the theory, let’s select a sub-sample of the postal code point layer and run the st_intersection() function in three different approaches:

  1. All points and all polygons (the standard approach)
  2. Geographic region of points and all polygons (province of points and Canada-wide polygons)
  3. Geographic region of points with the same geographic region of polygons
can_postal_codes_sample <- can_postal_codes %>%
  dplyr::sample_n(100000)

ns_sample_pnts <- nrow(as.data.frame(can_postal_codes_sample) %>%
                         dplyr::filter(PROV == "NS"))

# Approach 1
sf::st_intersection(can_postal_codes_sample,
                    dplyr::select(can_eco_regions, ERUID, ERNAME))

# Approach 2
sf::st_intersection(
  dplyr::filter(can_postal_codes_sample, PROV == "NS"),
  dplyr::select(can_eco_regions, ERUID, ERNAME)
)

# Approach 3
sf::st_intersection(
  dplyr::filter(can_postal_codes_sample, PROV == "NS"),
  dplyr::select(dplyr::filter(can_eco_regions, PROV == "NS"), ERUID, ERNAME)
)

This test should help us understand if our theory of simultaneously processing chunks of data (aka distributed processing) reduces data wrangling time. Using the sample_n() function from {dplyr}, we randomly select 100,000 postal code points for use in Approach 1, and ns_sample_pnts() points being used in Approach 2 and Approach 3, which result in the following processing times:

Point layer 100,000 randomly selected points across Canada
Polygon layer 76 polygons across Canada
Processing time 6.6 minutes


Point layer 3,026 randomly selected points across Nova Scotia
Polygon layer 76 polygons across Canada
Processing time 2.5 minutes


Point layer 3,026 randomly selected point across Nova Scotia
Polygon layer 5 polygons across Nova Scotia
Processing time 1.2 seconds


The results from the test validate the theory that processing points and polygons separated geographically is significantly faster than the alternative approaches.

Distributed processing with R

Now, let’s distributively process the Canada-wide postal codes with the Canada-wide economic regions.

There are several different options for distributed processing within R, but we will leverage the {future} and {furrr} packages for this story. One of the advantages of using {furrr} is that it follows the tidyverse syntax, with a few of {furrr}’s functions looking very similar to {purrr} functions. This approach provides ample power without the need to learn new programming languages, frameworks, or additional computer resources.

To take advantage of distributed processing, we need to prepare the data and select the distributed methods before performing the processing. To distribute process the Canada-wide postal codes and economic regions, we do the following four things:

  1. Break data into smaller chunks
  2. Identify the distributed method
  3. Perform the processing
  4. Combine the results
# Split Data by Province
# Postal codes -- points
can_pcodes_split <- can_postal_codes %>%
  dplyr::filter(PROV != "Non Canadian") %>%
  dplyr::group_by(PROV) %>%
  dplyr::group_split()

# Economic regions -- polygons
can_eco_regions_split <- can_eco_regions %>%
  dplyr::group_by(PROV) %>%
  dplyr::group_split()

# Distribute the processing
future::plan(multisession, workers = 8)

furrr::future_map2(can_pcodes_split, can_eco_regions_split, sf::st_intersection) %>%
  dplyr::bind_rows()


Point layer 889,291 point across Canada
Polygon layer 76 polygons across Canada
Processing time 3.6 minutes


Our findings

The results from the analysis were eye-opening. Breaking the data into smaller chunks allowed us to process our data simultaneously, significantly decreasing processing time. The Canada-wide distributed processing was completed in 3.6 minutes, almost two times faster than the 100,000 randomly sampled postal code points processed using traditional methods. Using the sample processing time to estimate Canada-wide, non-distributed processing time results in 58.3 minutes, which would be 16x longer than the distributed processing time!

Summary

Data wrangling is an unavoidable part of any data project, occupying a significant portion of time and resources. Reducing the effort during this stage allows more time to be allocated during analysis and modelling. Distributed processing is an option that can be implemented in certain data-wrangling situations that can seriously reduce processing time, not just in geospatial applications but in non-location data as well. With most, if not all, new computers being equipped with multiple processors, distributed processing becomes a low-cost solution that many data teams can employ to improve their data science pipelines.